Our dataframe has the following structure:
RecordID datecode Name Betha RF MKTRF Ri
60 1 2014-12-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -0.06 1.61
61 1 2015-01-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -3.11 -3.53
62 1 2015-02-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 6.13 5.49
63 1 2015-03-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -1.12 0.29
64 1 2015-04-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 0.59 0.67
65 1 2015-05-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 1.36 0.57
392035 3267 2019-07-01 Wasatch Core Growth Institutional 0.6421722 0.0019 1.19 6.75
392036 3267 2019-08-01 Wasatch Core Growth Institutional 0.6421722 0.0016 -2.58 0.09
392037 3267 2019-09-01 Wasatch Core Growth Institutional 0.6421722 0.0018 1.44 4.99
392038 3267 2019-10-01 Wasatch Core Growth Institutional 0.6421722 0.0015 2.06 -3.68
392039 3267 2019-11-01 Wasatch Core Growth Institutional 0.6421722 0.0012 3.87 5.35
392040 3267 2019-12-01 Wasatch Core Growth Institutional 0.6421722 0.0014 2.77 1.12
We need to compute yearly Jensen's Alpha and Fama & French 3-factor alphas and store them in separate columns in order to run regressions on them. The formulas for both regressions are illustrated below:
Jensen Alpha: Ri ~ a + B1*MKTRF + e
3-factor Alpha: Ri ~ a + B1*MKTRF + B2*SMB + B3*HML + B4*UMD + e
We have tried saving the data as a data table and in a panel data format and run this regression from a similar post to compute the Jensen Alpha:
dt[, alpha:= roll_regres.fit(x = cbind(1, .SD[["MKTRF"]]), y = .SD[["Ri"]], width = 12L)$coefs[, 1], by = RecordID]
The post: Rollregres with multiple regression and panel data
However, it did not work and kept giving the error message:
Error in roll_regres.fit(x = cbind(1, .SD[["MKTRF"]]), y = .SD[["Ri"]], :
subscript out of bounds
We are trying to use the "rollRegress" package and no additional packages have been used.
What are we doing wrong and is there anybody that can help us compute the yearly Jensen Alpha which we can store in a separate column? :)
Related
I have a panel of stock data and would like to calculate the beta of the stocks. Amongst these stocks I have included the SP500 data, which would of course be the market data that I use to calculate the beta for each stock.
I would like to calculate a weekly beta for each of the stocks using the following formula:
cov[R_{i,d (element of) t}, R_{m,d (element of) t}]/var(R_mt)
Where R = Returns, i = asset, d = day, t = week, m = market(SP500)
ave_beta <- df %>%
group_by(asset, week) %>%
summarize(ave_beta = ??? ) %>%
ungroup()
This is my code so far, however, I am not sure how to filter my data to include the SP500 data.
As I effectively want a weekly beta calculated from daily data for each asset.
EDIT:
Example of my data below:
date open high low close volume asset week lclose ret logclose
2 2020-06-29 4.36 4.41 4.31 4.35 30270600 NOK 1 4.34 0.002301497 1.470176
3 2020-06-30 4.30 4.41 4.30 4.40 21440300 NOK 1 4.35 0.011428696 1.481605
4 2020-07-01 4.36 4.41 4.34 4.35 24700300 NOK 1 4.40 -0.011428696 1.470176
5 2020-07-02 4.43 4.50 4.43 4.44 26398700 NOK 1 4.35 0.020478531 1.490654
6 2020-07-06 4.63 4.75 4.59 4.60 56364200 NOK 2 4.44 0.035401927 1.526056
7 2020-07-07 4.57 4.58 4.30 4.31 58948300 NOK 2 4.60 -0.065118399 1.460938
date open high low close volume asset week lclose ret logclose
7890 2021-06-18 201.3928 204.8400 194.5100 199.19 6663813 MRNA 51 202.47 -0.0163325843 5.294259
7891 2021-06-21 200.8600 211.0400 200.0000 208.24 7423063 MRNA 52 199.19 0.0444321176 5.338691
7892 2021-06-22 210.7600 222.4048 210.2501 221.36 10066344 MRNA 52 208.24 0.0610990748 5.399790
7893 2021-06-23 219.9800 224.5700 205.5500 212.04 14558440 MRNA 52 221.36 -0.0430153994 5.356775
7894 2021-06-24 214.3800 221.4900 213.4600 220.14 8171884 MRNA 52 212.04 0.0374887715 5.394264
7895 2021-06-25 221.2600 226.5100 216.3300 219.94 13315616 MRNA 52 220.14 -0.0009089257 5.393355
I would like to assign a variable with a custom factor from an ANOVA model to the emmeans() statement. Here I use the oranges dataset from R to make the code reproducible. This is my model and how I would usually calculate the emmmeans of the factor store:
library(emmeans)
oranges$store<-as.factor(oranges$store)
model <- lm (sales1 ~ 1 + price1 + store ,data=oranges)
means<-emmeans(model, pairwise ~ store, adjust="tukey")
Now I would like to assign a variable (lsmeanfact) defining the factor for which the lsmeans are calculated.
lsmeanfact<-"store"
However, when I want to evaluate this variable in the emmeans() function it returns an error, it basically does not find the variable lsmeanfact, so it does not evaluate this variable.
means<-emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust="tukey")
Error in emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust = "tukey") :
No variable named lsmeanfact in the reference grid
How should I change my code to be able to evaluate the variable lsmeanfact so that the lsmeans for "plantcode" are correctly calculated?
You can make use of reformulate function.
library(emmeans)
lsmeanfact<-"store"
means <- emmeans(model, reformulate(lsmeanfact, 'pairwise'), adjust="tukey")
Or construct a formula with formula/as.formula.
means <- emmeans(model, formula(paste('pairwise', lsmeanfact, sep = '~')), adjust="tukey")
Here both reformulate(lsmeanfact, 'pairwise') and formula(paste('pairwise', lsmeanfact, sep = '~')) return pairwise ~ store.
You do not need to do anything special at all. The specs argument to emmeans() can be a character value. You can get the pairwise comparisons in a separate call, which is actually a better way to go anyway.
library(emmeans)
model <- lm(sales1 ~ price1 + store, data = oranges)
lsmeanfact <- "store"
( EMM <- emmeans(model, lsmeanfact) )
## store emmean SE df lower.CL upper.CL
## 1 8.01 2.61 29 2.67 13.3
## 2 9.60 2.30 29 4.89 14.3
## 3 7.84 2.30 29 3.13 12.6
## 4 10.44 2.35 29 5.63 15.2
## 5 10.19 2.28 29 5.53 14.9
## 6 15.22 2.28 29 10.56 19.9
##
## Confidence level used: 0.95
pairs(EMM)
## contrast estimate SE df t.ratio p.value
## 1 - 2 -1.595 3.60 29 -0.443 0.9976
## 1 - 3 0.165 3.60 29 0.046 1.0000
## 1 - 4 -2.428 3.72 29 -0.653 0.9856
## 1 - 5 -2.185 3.50 29 -0.625 0.9882
## 1 - 6 -7.209 3.45 29 -2.089 0.3206
## 2 - 3 1.761 3.22 29 0.546 0.9936
## 2 - 4 -0.833 3.23 29 -0.258 0.9998
## 2 - 5 -0.590 3.23 29 -0.182 1.0000
## 2 - 6 -5.614 3.24 29 -1.730 0.5239
## 3 - 4 -2.593 3.23 29 -0.802 0.9648
## 3 - 5 -2.350 3.23 29 -0.727 0.9769
## 3 - 6 -7.375 3.24 29 -2.273 0.2373
## 4 - 5 0.243 3.26 29 0.075 1.0000
## 4 - 6 -4.781 3.28 29 -1.457 0.6930
## 5 - 6 -5.024 3.23 29 -1.558 0.6314
##
## P value adjustment: tukey method for comparing a family of 6 estimates
Created on 2021-06-29 by the reprex package (v2.0.0)
Moreover, in any case what is needed in specs are the name(s) of the factors involved, not the factors themselves. Note also that it was unnecessary to convert store to a factor before fitting the model
I am trying to do a meta-analysis using hazard ratio, lower and upper 95% confidence interval but for example CARDIa study, obtained upper and lower 95%CI ([2.1560; 9.9858]) were different than the original values (1.33-6.16) and I do not know how to get the exact numbers.
Any advice will be greatly appreciated.
Used code:
library(meta);library(metafor)
data<-read.table(text="studlab HR LCI UCI
Blazek 1.78 0.84 3.76
PRECOMBAT 1.20 0.37 3.93
LE.MANS 1.14 0.3 4.25
NOBLE 2.99 1.66 5.39
MASS-II 2.90 1.39 6.01
CARDIa 4.64 1.33 6.16
BEST 2.75 1.16 6.54
", header=T, sep="")
metagen(log(HR), lower = log(LCI), upper = log(UCI),
studlab = studlab,data=data, sm = "HR")
Obtained results
HR 95%-CI %W(fixed) %W(random)
Blazek 1.7800 [0.8413; 3.7659] 16.4 16.5
PRECOMBAT 1.2000 [0.3682; 3.9109] 6.6 7.1
LE.MANS 1.1400 [0.3029; 4.2908] 5.2 5.7
NOBLE 2.9900 [1.6593; 5.3878] 26.6 25.0
MASS-II 2.9000 [1.3947; 6.0301] 17.2 17.2
CARDIa 4.6400 [2.1560; 9.9858] 15.7 15.8
BEST 2.7500 [1.1582; 6.5297] 12.3 12.7
Number of studies combined: k = 7
HR 95%-CI z p-value
Fixed effect model 2.5928 [1.9141; 3.5122] 6.15 < 0.0001
Random effects model 2.5695 [1.8611; 3.5477] 5.73 < 0.0001
Quantifying heterogeneity:
tau^2 = 0.0181 [0.0000; 0.9384]; tau = 0.1347 [0.0000; 0.9687];
I^2 = 9.4% [0.0%; 73.6%]; H = 1.05 [1.00; 1.94]
Test of heterogeneity:
Q d.f. p-value
6.63 6 0.3569
Details on meta-analytical method:
- Inverse variance method
- DerSimonian-Laird estimator for tau^2
- Jackson method for confidence interval of tau^2 and tau```
The CI output matches the original CI to 2 decimal places in all studies except for CARDIa, which I think has been incorrectly entered (forgive me if I'm wrong but I can't see any other explanation).
You can see this by calculating the standard errors manually and then recalculating the confidence intervals, much like the metagen function does.
library(meta)
se <- meta:::TE.seTE.ci(log(data$LCI), log(data$UCI))$seTE; se
#[1] 0.3823469 0.6027896 0.6762603 0.3004463 0.3735071 0.3910526 0.4412115
data$lower <- round(exp(ci(TE=log(data$HR), seTE=se)$lower), 3)
data$upper <- round(exp(ci(TE=log(data$HR), seTE=se)$upper), 3)
data
studlab HR LCI UCI lower upper
1 Blazek 1.78 0.84 3.76 0.841 3.766 #
2 PRECOMBAT 1.20 0.37 3.93 0.368 3.911 #
3 LE.MANS 1.14 0.30 4.25 0.303 4.291 #
4 NOBLE 2.99 1.66 5.39 1.659 5.388 #
5 MASS-II 2.90 1.39 6.01 1.395 6.030 #
6 CARDIa 4.64 1.33 6.16 2.156 9.986 # <- this one is incorrect.
7 BEST 2.75 1.16 6.54 1.158 6.530 #
The correct 95% CI for CARDIa should be around (2.16 - 9.99). I would verify that you have typed the values correctly.
When i using cross validation technique with my data it gives me two types of prediction. CVpredict and Predict. What is difference between two of that? I guess cvpredict is cross validation predict but what is the other?
Here is some of my code:
crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)
And this is the result:
fold 1
Observations in test set: 5
3 11 15 22 23
Predicted 28.02 32.21 26.53 25.1 21.28
cvpred 20.23 40.69 26.57 34.1 26.06
verim 30.00 31.00 28.00 24.0 20.00
CV residual 9.77 -9.69 1.43 -10.1 -6.06
Sum of squares = 330 Mean square = 66 n = 5
fold 2
Observations in test set: 5
2 7 21 24 25
Predicted 28.4 32.0 26.2 19.95 25.9
cvpred 52.0 81.8 36.3 14.28 90.1
verim 30.0 33.0 24.0 21.00 24.0
CV residual -22.0 -48.8 -12.3 6.72 -66.1
Sum of squares = 7428 Mean square = 1486 n = 5
fold 3
Observations in test set: 5
6 14 18 19 20
Predicted 34.48 36.93 19.0 27.79 25.13
cvpred 37.66 44.54 16.7 21.15 7.91
verim 33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54 1.3 9.85 18.09
Sum of squares = 539 Mean square = 108 n = 5
fold 4
Observations in test set: 5
1 4 5 9 13
Predicted 31.91 29.07 32.5 32.7685 28.9
cvpred 30.05 28.44 54.9 32.0465 11.4
verim 32.00 27.00 31.0 32.0000 30.0
CV residual 1.95 -1.44 -23.9 -0.0465 18.6
Sum of squares = 924 Mean square = 185 n = 5
fold 5
Observations in test set: 5
8 10 12 16 17
Predicted 27.8 30.28 26.0 27.856 35.14
cvpred 50.3 33.92 45.8 31.347 29.43
verim 28.0 30.00 24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347 8.57
Sum of squares = 1065 Mean square = 213 n = 5
Overall (Sum over all 5 folds)
ms
411
You can check that by reading the help of the function you are using cv.lm. There you will find this paragraph:
The input data frame is returned, with additional columns
‘Predicted’ (Predicted values using all observations) and ‘cvpred’
(cross-validation predictions). The cross-validation residual sum
of squares (‘ss’) and degrees of freedom (‘df’) are returned as
attributes of the data frame.
Which says that Predicted is a vector of predicted values made using all the observations. In other words it seems like a predictions made on your "training" data or made "in sample".
To check wether this is so you can fit the same model using lm:
fit <- lm(verim~X4+X4.1, data=total)
And see if the predicted values from this model:
predict(fit)
are the same as those returned by cv.lm
When I tried it on the iris dataset in R - cv.lm() predicted returned the same values as predict(lm). So in that case - they are in-sample predictions where the model is fitted and used using the same observations.
lm() does not give "better results." I am not sure how predict() and lm.cv() can be the same. Predict() returns the expected values of Y for each sample, estimated from the fitted model (covariates (X) and their corresponding estimated Beta values). Those Beta values, and the model error (E), were estimated from that original data. By using predict(), you get an overly optimistic estimate of model performance. That is why it seems better. You get a better (more realistic) estimate of model performance using an iterated sample holdout technique, like cross validation (CV). The least biased estimate comes from leave-one-out CV and the estimate with the least uncertainty (prediction error) comes from 2-fold (K=2) CV.
I am working on a MARS model using earth package in R. My dataset (CE.Rda) consists of one dependent variable (D9_RTO_avg) and 10 potential predictors (NDVI_l1, NDVI_f0, NDVI_f1, NDVI_f2, NDVI_f3, LST_l1, LST_f0, LST_f1, NDVI_f2,NDVI_f3). Next, I show you the head of my dataset
D9_RTO_avg NDVI_l1 NDVI_f0 NDVI_f1 NDVI_f2 NDVI_f3 LST_l1 LST_f0 LST_f1 LST_f2 LST_f3
2 1.866667 0.3082 0.3290 0.4785 0.4330 0.5844 38.25 30.87 31 21.23 17.92
3 2.000000 0.2164 0.2119 0.2334 0.2539 0.4686 35.7 29.7 28.35 21.67 17.71
4 1.200000 0.2324 0.2503 0.2640 0.2697 0.4726 40.13 33.3 28.95 22.81 16.29
5 1.600000 0.1865 0.2070 0.2104 0.2164 0.3911 43.26 35.79 30.22 23.07 17.88
6 1.800000 0.2757 0.3123 0.3462 0.3778 0.5482 43.99 36.06 30.26 21.36 17.93
7 2.700000 0.2265 0.2654 0.3174 0.2741 0.3590 41.61 35.4 27.51 23.55 18.88_
After creating my earth model as follows
mymodel.mod <- earth(D9_RTO_avg ~ ., data=CE, nk=10)
I print the summary of the resulting model by typing
print(summary(mymodel.mod, digits=2, style="pmax"))
and I obtain the following output
D9_RTO_avg =
4.1
+ 38 * LST_f128.68
+ 6.3 * LST_f216.41
- 2.9 * pmax(0, 0.66 - NDVI_l1)
- 2.3 * pmax(0, NDVI_f3 - 0.23)
Selected 5 of 7 terms, and 4 of 13169 predictors
Termination condition: Reached nk 10
Importance: LST_f128.68, NDVI_l1, NDVI_f3, LST_f216.41, NDVI_f0-unused, NDVI_f1-unused, NDVI_f2-unused, ...
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 2 RSS 4046 GRSq 0.29 RSq 0.29
My question is why earth is identifying 13169 predictors when they are actually 10!? It seems that MARS is considering single observations of candidate predictors as predictors themselves. How can I avoid MARS from doing so?
Thanks for your help